3. Building an FTS Engine on Azure
That was quite a bit of theory to set up what you will do next:
build your own FTS engine on Windows Azure storage.
3.1. Picking a data source
The first thing you need is some data to index and search.
Feel free to use any data you have lying around. The code you are
about to see should work on any set of text files.
To find a good source of sample data let’s turn to an easily
available and widely used source: Project Gutenberg. This amazing project provides
thousands of free books online in several accessible licenses under a
very liberal license. You can download your own copies from http://www.gutenberg.org. If you’re feeling lazy, you
can download the exact Gutenberg book files that have been used here
from http://www.sriramkrishnan.com/windowsazurebook/gutenberg.zip.
Why use plain-text files and not some structured data? There is
no reason, really. You can easily modify the code samples you’re about
to see and import some structured data, or data from a custom data
source.
3.2. Setting up the project
To keep this sample as simple as possible, let’s build a
basic console application. This console application will perform only
two tasks. First, when pointed to a set of files in a directory, it
will index them and create the inverted index in Windows Azure
storage. Second, when given a search query, it will search the index
in Azure storage. Sounds simple, right?
Create a .NET 3.5 Console Application project using Visual
Studio. In this sample, call the project FTS, which makes the project’s namespace
FTS by default. If you’re
calling your project by a different name, remember to fix the
namespace.
Add references to the assemblies System.Data.Services.dll and System.Data.Services.Client.dll. This
brings in the assemblies you need for ADO.NET Data Services
support.
Bring in the Microsoft.WindowsAzure.StorageClient
library to talk to Azure storage.
Set up the configuration file with the right account name
and shared key by adding a new App.config to the project and entering
the following contents. Remember to fill in your account name,
key, and table storage endpoint:
<?xml version="1.0" encoding="utf-8" ?>
<configuration>
<appSettings>
<add key="DataConnectionString" value
="AccountName=YourAccountName;AccountKey=YourAccountKey==;
DefaultEndpointsProtocol=https"/>
</appSettings>
<system.net>
<settings>
<servicePointManager expect100Continue="false" useNagleAlgorithm="false" />
</settings>
</system.net>
</configuration>
3.3. Modeling the data
As you learned earlier, you must create two key data
structures. The first is a mapping between document IDs and documents.
You will be storing that in a table in Azure storage. To do that, you
use the following wrapper class inherited from TableServiceEntity. Add the following code
as Document.cs to your project:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Data.Services;
using System.Data.Services.Client;
using Microsoft.WindowsAzure.StorageClient;
namespace FTS
{
public class Document:TableServiceEntity
{
public Document( string title, string id):base(id, id)
{
this.Title = title;
this.ID = id;
}
public Document():base()
{
//Empty-constructor for ADO.NET Data Services
}
public string Title { get; set; }
public string ID { get;set;}
}
This class wraps around an “entity” (row) in a Document table. Every entity has a unique
ID, and a title that corresponds to the title of the book you are
storing. In this case, you are going to show only the title in the
results, so you’ll be storing only the title in Azure storage. If you
wanted, you could choose to store the contents of the books
themselves, which would let you show book snippets in the results. You
use the document ID as the partition key, which will place every
document in a separate partition. This provides optimum performance
because you can always specify the exact partition you want to access
when you write your queries.
The second key data structure you need is an inverted index. As
discussed earlier, an inverted index stores a mapping between index
terms and documents. To make this indexing easier, you use a small
variant of the design you saw in Table 11-2.
In that table, you saw every index term being unique and mapping
to a list of document IDs. Here, you have a different table entry for
every index term-document ID pair. This provides a lot of flexibility.
For example, if you move to a parallel indexing model, you can add
term-to-document ID mappings without worrying about trampling over a
concurrent update.
Save the following code as IndexEntry.cs and add it to your project:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Data.Services;
using System.Data.Services.Client;
using Microsoft.WindowsAzure.StorageClient;
namespace FTS
{
public class IndexEntry:TableServiceEntity
{
public IndexEntry(string term, string docID)
: base(term, docID)
{
this.Term = term;
this.DocID = docID;
}
public IndexEntry()
: base()
{
//Empty constructor for ADO.NET Data Services
}
public string Term { get; set; }
public string DocID { get; set; }
}
}
At this point, you might be wondering how you just get a list of
documents in which a term appears easily and quickly using this
design. To make that happen, note that, in the code, all entries with
the same term will go into the same partition, because you use “term”
as the partition key. To get a list of documents in which a
term appears, you just query for all entities within the term
partition.
This is easier to understand with the help of a picture. Figure 1 shows the index table
containing the mappings for two terms, foo and
bar. Since each term gets its own partition, the
index table has two partitions. Each partition has several entries,
each corresponding to a document in which the term appears.
This essentially wraps around the two classes you just
wrote, and enables you to query them from ADO.NET Data
Services:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using Microsoft.WindowsAzure;
using Microsoft.WindowsAzure.StorageClient;
using System.Data.Services.Client;
namespace FTS
{
public class FTSDataServiceContext:TableServiceContext
{
public FTSDataServiceContext(string baseAddress,
StorageCredentials credentials)
: base(baseAddress, credentials)
{
}
public const string DocumentTableName = "DocumentTable";
public IQueryable<Document> DocumentTable
{
get
{
return this.CreateQuery<Document>(DocumentTableName);
}
}
public const string IndexTableName = "IndexTable";
public IQueryable<IndexEntry> IndexTable
{
get
{
return this.CreateQuery<IndexEntry>(IndexTableName);
}
}
}
}
3.4. Adding a mini console
The following trivial helper code enables you to test out
various text files and search for various terms. Replace your
Program.cs with the following code. This essentially lets you call
out to an Index method or a
Search method based on whether you
enter index or search in the console. You’ll be writing
both very soon, so let’s just leave stub implementations for
now:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
using Microsoft.WindowsAzure.StorageClient;
using Microsoft.WindowsAzure;
namespace FTS
{
class Program
{
static void Main(string[] args)
{
CreateTables();
Console.WriteLine("Enter command - 'index <directory-path>'
or 'search <query>' or 'quit'");
while (true)
{
Console.Write(">");
var command = Console.ReadLine();
if (command.StartsWith("index"))
{
var path = command.Substring(6, command.Length - 6);
Index(path);
}
else if (command.StartsWith("search"))
{
var query = command.Substring(6, command.Length - 6);
Search(query);
}
else if (command.StartsWith("quit"))
{
return;
}
else
{
Console.WriteLine("Unknown command");
}
}
}
static void Index(){}
static void Search(){}
}
}
3.5. Creating the tables
At the top of Main, you see
a CreateTables method call.
As the name implies, this creates your tables in Azure
table storage if they don’t already exist. To do that, add the
following code below Main in
Program.cs:
static void CreateTables()
{
var account = CloudStorageAccount.Parse(ConfigurationSettings.AppSettings
["DataConnectionString"]);
var svc =
new FTSDataServiceContext(account.TableEndpoint.ToString(),
account.Credentials);
account.CreateCloudTableClient().CreateTableIfNotExist
(FTSDataServiceContext.IndexTableName);
account.CreateCloudTableClient().CreateTableIfNotExist
(FTSDataServiceContext.DocumentTableName);
}